In this lab, you will practice methods for dealing with missing data when using lavaan to fit latent variable models.
For this lab, we will go back to working with the Eating Attitudes data from Lab 4. These data are available as eating_attitudes.rds. Unlike the completed version that we analyzed in Lab 4, we will now work with the incomplete version that is actually analyzed in Enders (2010).
As before, this dataset includes 400 observations of the following 14 variables. Note that the variables are listed in the order that they appear on the dataset.
id: A numeric IDeat1:eat24: Seven indicators of a Drive for Thinness constructeat3:eat21: Three indicators of a Preoccupation with Food constructbmi: Body mass indexwsb: A single item assessing Western Standards of Beautyanx: A single item assessing Anxiety LevelYou can download the original data here, and you can access the code used to process the data here.
Read in the eating_attitudes.rds dataset.
dataDir <- "../../data/"
ea <- readRDS(paste0(dataDir, "eating_attitudes.rds"))
NOTE:
Summarize the EA data to get a sense of their characteristics.
head(ea)
summary(ea)
id eat1 eat2 eat10
Min. : 1.0 Min. :2.000 Min. :1.000 Min. :1.000
1st Qu.:100.8 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000
Median :200.5 Median :4.000 Median :4.000 Median :4.000
Mean :200.5 Mean :3.987 Mean :3.939 Mean :3.938
3rd Qu.:300.2 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :400.0 Max. :7.000 Max. :7.000 Max. :7.000
NA's :27 NA's :21 NA's :30
eat11 eat12 eat14 eat24
Min. :1.000 Min. :2.000 Min. :1.000 Min. :1.000
1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000
Median :4.000 Median :4.000 Median :4.000 Median :4.000
Mean :3.938 Mean :3.905 Mean :3.962 Mean :3.954
3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :7.000 Max. :7.000 Max. :7.000 Max. :7.000
NA's :33 NA's :33
eat3 eat18 eat21 bmi
Min. :1.000 Min. :2.000 Min. :1.000 Min. :15.23
1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:20.43
Median :4.000 Median :4.000 Median :4.000 Median :22.42
Mean :3.967 Mean :3.951 Mean :3.953 Mean :22.39
3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:24.21
Max. :7.000 Max. :7.000 Max. :7.000 Max. :31.52
NA's :32 NA's :19 NA's :5
wsb anx
Min. : 4.000 Min. : 4.00
1st Qu.: 8.000 1st Qu.:10.00
Median : 9.000 Median :12.00
Mean : 8.946 Mean :11.94
3rd Qu.:10.000 3rd Qu.:14.00
Max. :14.000 Max. :20.00
NA's :27 NA's :18
str(ea)
'data.frame': 400 obs. of 14 variables:
$ id : num 1 2 3 4 5 6 7 8 9 10 ...
$ eat1 : num 4 6 3 3 3 4 5 4 4 6 ...
$ eat2 : num 4 5 3 3 2 5 4 3 7 5 ...
$ eat10: num 4 6 2 4 3 4 4 4 6 5 ...
$ eat11: num 4 6 2 3 3 5 4 4 5 5 ...
$ eat12: num 4 6 3 4 3 4 4 4 4 6 ...
$ eat14: num 4 7 2 4 3 4 4 4 6 6 ...
$ eat24: num 3 6 3 3 3 4 4 4 4 5 ...
$ eat3 : num 4 5 3 3 4 4 3 6 4 5 ...
$ eat18: num 5 6 3 5 4 5 3 6 4 6 ...
$ eat21: num 4 5 2 4 4 4 3 5 4 5 ...
$ bmi : num 18.9 26 18.3 18.2 24.4 ...
$ wsb : num 9 13 6 5 10 7 11 8 10 12 ...
$ anx : num 11 19 8 14 7 11 12 12 14 12 ...
Before we can make any headway with missing data treatment or substantive analyses, we need to get a handle on the extent of our missing data problem. So, we will begin with some descriptive analysis of the missing data.
Calculate the proportion of missing data for each variable.
library(dplyr)
## Calculate variablewise proportions missing:
ea %>% is.na() %>% colMeans()
id eat1 eat2 eat10 eat11 eat12 eat14 eat24 eat3 eat18 eat21
0.0000 0.0675 0.0525 0.0750 0.0000 0.0825 0.0000 0.0825 0.0000 0.0800 0.0475
bmi wsb anx
0.0125 0.0675 0.0450
Calculate the covariance coverage rates.
md.pairs() function from the mice package to
count the number of jointly observed cases for every pair or variables.library(mice) # Provides md.pairs()
library(magrittr) # Provides the extract2() alias function
## Calculate covariance coverage:
cc <- (ea %>% select(-id) %>% md.pairs() %>% extract2("rr")) / nrow(ea)
cc %>% round(2)
eat1 eat2 eat10 eat11 eat12 eat14 eat24 eat3 eat18 eat21 bmi wsb anx
eat1 0.93 0.89 0.87 0.93 0.86 0.93 0.86 0.93 0.86 0.89 0.92 0.88 0.90
eat2 0.89 0.95 0.87 0.95 0.87 0.95 0.88 0.95 0.88 0.90 0.94 0.88 0.90
eat10 0.87 0.87 0.92 0.92 0.86 0.92 0.86 0.92 0.86 0.88 0.92 0.86 0.88
eat11 0.93 0.95 0.92 1.00 0.92 1.00 0.92 1.00 0.92 0.95 0.99 0.93 0.96
eat12 0.86 0.87 0.86 0.92 0.92 0.92 0.85 0.92 0.86 0.87 0.91 0.87 0.87
eat14 0.93 0.95 0.92 1.00 0.92 1.00 0.92 1.00 0.92 0.95 0.99 0.93 0.96
eat24 0.86 0.88 0.86 0.92 0.85 0.92 0.92 0.92 0.86 0.88 0.90 0.85 0.88
eat3 0.93 0.95 0.92 1.00 0.92 1.00 0.92 1.00 0.92 0.95 0.99 0.93 0.96
eat18 0.86 0.88 0.86 0.92 0.86 0.92 0.86 0.92 0.92 0.88 0.91 0.86 0.88
eat21 0.89 0.90 0.88 0.95 0.87 0.95 0.88 0.95 0.88 0.95 0.94 0.89 0.91
bmi 0.92 0.94 0.92 0.99 0.91 0.99 0.90 0.99 0.91 0.94 0.99 0.92 0.94
wsb 0.88 0.88 0.86 0.93 0.87 0.93 0.85 0.93 0.86 0.89 0.92 0.93 0.89
anx 0.90 0.90 0.88 0.96 0.87 0.96 0.88 0.96 0.88 0.91 0.94 0.89 0.96
Summarize the coverages from 1.2.2.
Covariance coverage matrices are often very large and, hence, difficult to parse. It can be useful to distill this information into a few succinct summaries to help extract the useful knowledge.
One of the most useful numeric summaries is the range. We’ll start there.
NOTE:
## Range of coverages < 1.0:
## NOTE: Use lower.tri() to select only the elements below the diagonal.
uniqueCc <- cc[lower.tri(cc)]
uniqueCc[uniqueCc < 1] %>% range()
[1] 0.8500 0.9875
Sometimes, we may want to count the number of coverages that satisfy some condition (e.g., fall below some threshold). When doing such a summary, we need to remember two idiosyncrasies of the covariance coverage matrix:
So, to get a count of covariance coverages without double counting, we consider only the elements from the lower (or upper) triangle.
## Count the number of coverages lower than 0.9:
sum(uniqueCc < 0.9)
[1] 32
Visualize the covariance coverage rates from 1.2.2.
As with numeric summaries, visualizations are also a good way to distill meaningful knowledge from the raw information in a covariance coverage matrix.
We’ll try two different visualizations.
First, we’ll create a simple histogram of the coverages to get a sense of their distribution. To see why such a visualization is useful, consider two hypothetical situations wherein the range of coverages is [0.2; 0.8].
Clearly, the second situation is less problematic, but we cannot differentiate between these two simply by examining the range of coverages.
library(ggplot2)
## Simple histogram of coverages:
data.frame(Coverage = uniqueCc) %>%
ggplot(aes(Coverage)) +
geom_histogram(bins = 15, color = "black")
Next, we’ll create a heatmap of the coverage matrix itself. Such a visualization could help us locate clusters of variables with especially low coverages, for example.
## Convert the coverage matrix into a plotable, tidy-formatted data.frame:
pDat <- data.frame(Coverage = as.numeric(cc),
x = rep(colnames(cc), ncol(cc)),
y = rep(colnames(cc), each = ncol(cc))
)
## Create the heatmap via the "tile" geom:
ggplot(pDat, aes(x = x, y = y, fill = Coverage)) +
geom_tile() +
scale_x_discrete(limits = rev(colnames(cc))) +
scale_y_discrete(limits = colnames(cc)) +
scale_fill_distiller(palette = "Oranges") +
xlab(NULL) +
ylab(NULL)
Visualize the missing data patterns.
HINT:
plot_pattern() function from ggmice will create a nice
visualization of the patterns.md.pattern() function from mice will create a (somewhat less beautiful)
visualization but will return a numeric pattern matrix that you can further analyze.library(ggmice)
## Visualize the response patterns:
plot_pattern(ea)
## Create an analyzable pattern matrix (without visualization):
(pats <- md.pattern(ea, plot = FALSE))
id eat11 eat14 eat3 bmi anx eat21 eat2 eat1 wsb eat10 eat18 eat12 eat24
242 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
12 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
10 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
2 1 1 1 1 1 1 1 1 1 1 1 1 0 0 2
8 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
3 1 1 1 1 1 1 1 1 1 1 1 0 1 0 2
3 1 1 1 1 1 1 1 1 1 1 1 0 0 1 2
2 1 1 1 1 1 1 1 1 1 1 1 0 0 0 3
14 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1 0 1 1 0 2
1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 2
2 1 1 1 1 1 1 1 1 1 1 0 0 1 1 2
2 1 1 1 1 1 1 1 1 1 1 0 0 0 1 3
12 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1
3 1 1 1 1 1 1 1 1 1 0 1 1 0 1 2
3 1 1 1 1 1 1 1 1 1 0 1 0 0 1 3
1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 3
5 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 2
2 1 1 1 1 1 1 1 1 0 1 1 1 0 1 2
2 1 1 1 1 1 1 1 1 0 1 1 1 0 0 3
2 1 1 1 1 1 1 1 1 0 1 0 1 1 0 3
3 1 1 1 1 1 1 1 1 0 0 1 1 1 1 2
2 1 1 1 1 1 1 1 1 0 0 0 1 1 1 3
9 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 2
1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 2
2 1 1 1 1 1 1 1 0 1 1 1 0 1 1 2
3 1 1 1 1 1 1 1 0 1 1 1 0 1 0 3
1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 2
3 1 1 1 1 1 1 1 0 0 1 1 1 1 1 2
13 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 2
1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 3
1 1 1 1 1 1 1 0 1 1 1 0 0 1 1 3
1 1 1 1 1 1 1 0 1 1 0 1 0 1 1 3
1 1 1 1 1 1 1 0 1 0 0 1 0 1 1 4
10 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 2
4 1 1 1 1 1 0 1 1 0 1 1 1 1 1 2
1 1 1 1 1 1 0 1 1 0 1 1 1 1 0 3
1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 2
1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 2
3 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 0 1 1 1 1 1 0 1 0 1 3
1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 3
0 0 0 0 5 18 19 21 27 27 30 32 33 33 245
## Count the number of unique response patterns:
nrow(pats) - 1
[1] 46
As shown by the above code, there are 46 unique response patterns in the EA data.
Calculate the fraction of missing information for the summary statistics.
The fraction of missing information (FMI) is a crucial statistic for quantifying the extent to which missing data (and their treatment) affect parameter estimates. We generally calculate the FMI for the estimates of our substantive model parameters, but we can also use the FMI as a descriptive tool before fitting any substantive model.
fmi() function from semTools, we can efficiently
compute the FMI for sufficient statistics using the FIML-based approach
described by Savalei and Rhemtulla (2012).library(semTools)
fmi <- ea %>% select(-id) %>% fmi()
fmi$Covariances$fmi
eat1 eat2 eat10 eat11 eat12 eat14 eat24 eat3 eat18 eat21 bmi wsb anx
eat1 0.063
eat2 0.067 0.030
eat10 0.073 0.032 0.043
eat11 0.029 0.014 0.023 0.000
eat12 0.090 0.056 0.072 0.061 0.080
eat14 0.031 0.008 0.019 0.000 0.040 0.000
eat24 0.096 0.088 0.064 0.062 0.071 0.055 0.087
eat3 0.050 0.018 0.025 0.000 0.075 0.000 0.077 0.000
eat18 0.101 0.068 0.065 0.049 0.074 0.041 0.116 0.037 0.071
eat21 0.052 0.044 0.048 0.023 0.080 0.016 0.091 0.017 0.062 0.044
bmi 0.059 0.032 0.035 0.006 0.058 0.004 0.090 0.010 0.063 0.033 0.011
wsb 0.083 0.096 0.097 0.035 0.127 0.061 0.093 0.030 0.091 0.078 0.064 0.067
anx 0.076 0.050 0.039 0.025 0.089 0.037 0.085 0.042 0.084 0.065 0.039 0.093 0.056
fmi$Means %>% select(variable, fmi) %>% print(digits = 3)
variable fmi
92 eat1 0.038
93 eat2 0.030
94 eat10 0.032
95 eat11 0.000
96 eat12 0.052
97 eat14 0.000
98 eat24 0.053
99 eat3 0.000
100 eat18 0.046
101 eat21 0.023
102 bmi 0.010
103 wsb 0.060
104 anx 0.028
Of course, we can estimate models in lavaan without doing anything fancy to the missing data. If we fit a model to incomplete data using lavaan, the software will automatically apply listwise deletion to remove any missing data before estimating the model.
Although you should almost never use deletion-based treatments, we’ll start our modeling exercises by fitting a model using the default complete case analysis. Sometimes, something goes wrong with the modeling, and you end up deleting cases accidentally. If you’re keyed into the signs that listwise deletion has occurred, you’ll know what to check when working with these methods in the wild.
Define the model syntax for the CFA.
The data only contain multi-item scales for Drive for Thinness and Preoccupation with Food, so we only need a two-dimensional CFA evaluating the measurement structure of these two factors.
eat1, eat2, eat10, eat11, eat12, eat14, eat24eat3, eat18, eat21cfaMod <- '
drive =~ eat1 + eat2 + eat10 + eat11 + eat12 + eat14 + eat24
pre =~ eat3 + eat18 + eat21
'
Estimate the CFA model on the EA data.
naiveCfa <- cfa(cfaMod, data = ea, std.lv = TRUE, meanstructure = TRUE)
Summarize the fitted CFA and check the model fit.
summary(naiveCfa)
lavaan 0.6-12 ended normally after 25 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 31
Used Total
Number of observations 267 400
Model Test User Model:
Test statistic 57.511
Degrees of freedom 34
P-value (Chi-square) 0.007
Parameter Estimates:
Standard errors Standard
Information Expected
Information saturated (h1) model Structured
Latent Variables:
Estimate Std.Err z-value P(>|z|)
drive =~
eat1 0.718 0.057 12.524 0.000
eat2 0.661 0.053 12.391 0.000
eat10 0.814 0.051 15.913 0.000
eat11 0.751 0.047 15.975 0.000
eat12 0.668 0.055 12.207 0.000
eat14 0.943 0.050 18.852 0.000
eat24 0.628 0.055 11.462 0.000
pre =~
eat3 0.763 0.053 14.368 0.000
eat18 0.698 0.055 12.659 0.000
eat21 0.879 0.052 16.829 0.000
Covariances:
Estimate Std.Err z-value P(>|z|)
drive ~~
pre 0.647 0.044 14.824 0.000
Intercepts:
Estimate Std.Err z-value P(>|z|)
.eat1 3.884 0.064 60.958 0.000
.eat2 3.831 0.059 64.776 0.000
.eat10 3.884 0.061 63.654 0.000
.eat11 3.835 0.056 68.305 0.000
.eat12 3.888 0.060 64.296 0.000
.eat14 3.880 0.064 61.029 0.000
.eat24 3.884 0.060 65.144 0.000
.eat3 3.884 0.059 65.403 0.000
.eat18 3.895 0.060 65.119 0.000
.eat21 3.869 0.061 63.527 0.000
drive 0.000
pre 0.000
Variances:
Estimate Std.Err z-value P(>|z|)
.eat1 0.568 0.053 10.710 0.000
.eat2 0.497 0.046 10.736 0.000
.eat10 0.332 0.034 9.657 0.000
.eat11 0.278 0.029 9.626 0.000
.eat12 0.530 0.049 10.770 0.000
.eat14 0.191 0.027 7.164 0.000
.eat24 0.555 0.051 10.896 0.000
.eat3 0.359 0.043 8.307 0.000
.eat18 0.468 0.049 9.615 0.000
.eat21 0.217 0.042 5.110 0.000
drive 1.000
pre 1.000
fitMeasures(naiveCfa)
npar fmin chisq df
31.000 0.108 57.511 34.000
pvalue baseline.chisq baseline.df baseline.pvalue
0.007 1504.751 45.000 0.000
cfi tli nnfi rfi
0.984 0.979 0.979 0.949
nfi pnfi ifi rni
0.962 0.727 0.984 0.984
logl unrestricted.logl aic bic
-3027.345 -2998.589 6116.690 6227.895
ntotal bic2 rmsea rmsea.ci.lower
267.000 6129.606 0.051 0.027
rmsea.ci.upper rmsea.pvalue rmr rmr_nomean
0.073 0.447 0.036 0.039
srmr srmr_bentler srmr_bentler_nomean crmr
0.037 0.037 0.040 0.040
crmr_nomean srmr_mplus srmr_mplus_nomean cn_05
0.045 0.037 0.040 226.640
cn_01 gfi agfi pgfi
261.267 0.995 0.990 0.520
mfi ecvi
0.957 0.448
This model looks fine. All measurement model parameters seem reasonable, and the model fits the data well. That being said, 133 observations were deleted to “fix” the missing data. Also, we can be reasonably certain that these estimates are biased, unless the data were all missing completely at random (MCAR). Even though the estimates look “good”, they can still be wrong.
Now, we’ll get into the real meat of the lab and move into the realm of recommended approaches. Specifically, we will treat the missing data via principled methods that correct the missing data problem through a theoretically motivated model for the missing values.
The first such approach we will consider is multiple imputation (MI). MI is a pre-processing method that treats the missing data as a separate step prior to estimating the analysis model. Consequently, the fist step in our journey toward an MI-based analysis will be multiply imputing the missing data.
Multiply impute the missing data.
Use the mice::mice() function to impute the missing data using the
following settings.
HINT:
mice() function.View the parameterized
mice() call.
## Generate 20 imputations of the missing data:
miceOut <- mice(ea, m = 20, maxit = 25, method = "pmm", seed = 235711)
Create a list of imputed datasets from the mids object generated in 2.1.1.
The mice() function returns a mids object that contains the imputations and
a bunch of other metadata but not the imputed datasets.
complete() function to create the final imputed datasets.## Replace the missing values with the imputations (and return a list of datasets):
eaImp <- complete(miceOut, "all")
MI operates by filling the missing values with samples from the posterior predictive distribution (PPD) of the missing data. This PPD is essentially just a distribution of predictions generated from some Bayesian regression model. These models are usually estimated via Markov Chain Monte Carlo (MCMC) methods.
MCMC is a stochastic estimation procedure that estimates a probabilistic model by randomly sampling from the joint distribution of the model parameters (the so-called posterior distribution in Bayesian models) to build up an empirical representation of this distribution. When using MCMC, the estimated model (and, in our case, the imputations) is only valid if the MCMC algorithm has converged.
Therefore, before we can proceed with the analysis, we need to check two important criteria:
For more information on MCMC estimation, convergence, and diagnostics for MI models, see Sections 4.5, 6.5, and 6.6 of Van Buuren (2019).
Create traceplots of the imputations’ means and variances.
One especially expedient way to evaluate convergence is by plotting the iteration history of the means and variances of the imputed values for each variable. Once the MCMC algorithm has converged, the individual lines in these plots (each of which corresponds to one imputation and represents an independent Markov Chain) should mix thoroughly, and the overall mass of the tracelines should show no trend.
plot_trace() function from ggmice to create traceplots for each
imputed variable.## Get the names of the imputed variables:
targets <- which(miceOut$method != "") %>% names()
## Create traceplots of the imputed means and variances:
for(v in targets)
plot_trace(miceOut, v) %>% print()
The plots show well-mixed tracelines for each imputation, and there is no evidence of a global trend. So, these plots suggest convergence.
We can also evaluate convergence via numeric statistics such as the Potential Scale Reduction Factor, \(\hat{R}\), (Gelman & Rubin, 1992).
Compute the \(\hat{R}\) statistic for the means and variances of the imputations.
For mids objects, we can use the Rhat.mice() function from
miceadds to compute the \(\hat{R}\) statistics for the means and
variances of the imputed data.
library(miceadds)
## Compute the PSR factors for the means and variances of the imputed data:
Rhat.mice(miceOut)
variable MissProp Rhat.M.imp Rhat.Var.imp
1 id 0.00 NA NA
2 eat1 6.75 1.0054048 1.0012331
3 eat2 5.25 1.0073912 1.0177948
4 eat10 7.50 1.0111193 0.9995807
5 eat11 0.00 NA NA
6 eat12 8.25 1.0096568 1.0105978
7 eat14 0.00 NA NA
8 eat24 8.25 0.9953888 0.9972065
9 eat3 0.00 NA NA
10 eat18 8.00 1.0024234 1.0000535
11 eat21 4.75 1.0013914 1.0022511
12 bmi 1.25 0.9993516 1.0126109
13 wsb 6.75 1.0043349 1.0017893
14 anx 4.50 0.9998069 0.9952062
Yes, all \(\hat{R}\) statistics are less than 1.1 (and very near 1.0), so these statistics suggest convergence.
After we confirm that the imputation model has converged, we still need to sanity check the imputed values, themselves. A convergent imputation model can still produce completely ridiculous and useless imputations. The fact that the MCMC algorithm has converged onto a stationary distribution does not mean that this distribution represents a sensible imputation model.
One of the best ways to check the plausibility of the imputations is with plots that juxtapose the imputed and observed data. Since we’re comparing distributions of numeric variables, kernel density plots are a good option.
Create kernel density plots of the imputed and observed data.
We can create “quick-and-dirty” density plots via the mice::densityplot()
function, but ggmice::ggmice() allows us to create the same figures with
all the control provided by ggplot(). For more details check
this page.
for(v in targets) {
p <- ggmice(miceOut, aes_string(v, group = ".imp", size = ".where")) +
geom_density() +
scale_size_manual(values = c("observed" = 1, "imputed" = 0.5),
guide = "none") +
theme(legend.position = "none")
print(p)
}
Yes, the imputed values all seem plausible. The kernel densities for the imputed data all seem as though they could represent samples from a population similar to the one that generated the complete data.
Assuming our imputations pass the above checks, it’s time to move into the analysis phase of the process. In practice, we will lean heavily on routines from the semTools package to implement both the analysis and pooling phases.
lavaan.mi() to fit our models to the
list of multiply imputed datasets from 2.1.2.Estimate the CFA defined in 1.3.1 on the multiply imputed data.
HINT: The semTools::cfa.mi() function can be a big help here.
miCfa <- cfa.mi(cfaMod, data = eaImp, std.lv = TRUE, meanstructure = TRUE)
Summarize the fitted CFA and check the model fit.
summary(miCfa, fmi = TRUE) %>% print()
The RIV will exceed 1 whenever between-imputation variance exceeds within-imputation variance (when FMI(1) > 50%).
lavaan.mi object based on 20 imputed data sets.
See class?lavaan.mi help page for available methods.
Convergence information:
The model converged on 20 imputed data sets
Rubin's (1987) rules were used to pool point and SE estimates across 20 imputed data sets, and to calculate degrees of freedom for each parameter's t test and CI.
Parameter Estimates:
Standard errors Standard
Information Expected
Information saturated (h1) model Structured
Latent Variables:
Estimate Std.Err t-value df P(>|t|) FMI RIV
drive =~
eat1 0.731 0.051 14.419 5028.239 0.000 0.061 0.065
eat2 0.652 0.047 13.982 Inf 0.000 0.018 0.018
eat10 0.799 0.044 18.377 2831.970 0.000 0.082 0.089
eat11 0.764 0.041 18.472 Inf 0.000 0.008 0.008
eat12 0.662 0.047 14.032 3424.716 0.000 0.074 0.080
eat14 0.901 0.043 20.951 Inf 0.000 0.005 0.005
eat24 0.615 0.048 12.713 2035.951 0.000 0.097 0.107
pre =~
eat3 0.774 0.048 16.223 Inf 0.000 0.038 0.040
eat18 0.754 0.049 15.455 1637.941 0.000 0.108 0.121
eat21 0.860 0.046 18.673 1924.002 0.000 0.099 0.110
Covariances:
Estimate Std.Err t-value df P(>|t|) FMI RIV
drive ~~
pre 0.620 0.040 15.578 7586.696 0.000 0.050 0.053
Intercepts:
Estimate Std.Err t-value df P(>|t|) FMI RIV
.eat1 4.012 0.056 71.848 Inf 0.000 0.043 0.045
.eat2 3.939 0.051 77.279 Inf 0.000 0.019 0.020
.eat10 3.948 0.051 76.769 8365.873 0.000 0.048 0.050
.eat11 3.938 0.049 80.352 Inf 0.000 0.000 0.000
.eat12 3.925 0.052 76.052 5793.515 0.000 0.057 0.061
.eat14 3.962 0.053 74.316 Inf 0.000 0.000 0.000
.eat24 3.986 0.052 76.893 4440.152 0.000 0.065 0.070
.eat3 3.968 0.052 75.674 Inf 0.000 0.000 0.000
.eat18 3.983 0.053 75.128 2448.087 0.000 0.088 0.097
.eat21 3.949 0.052 75.410 Inf 0.000 0.025 0.026
drive 0.000
pre 0.000
Variances:
Estimate Std.Err t-value df P(>|t|) FMI RIV
.eat1 0.613 0.049 12.466 2604.622 0.000 0.085 0.093
.eat2 0.532 0.042 12.557 2021.696 0.000 0.097 0.107
.eat10 0.334 0.030 11.119 3100.033 0.000 0.078 0.085
.eat11 0.299 0.027 11.069 Inf 0.000 0.043 0.045
.eat12 0.542 0.043 12.546 1229.291 0.000 0.124 0.142
.eat14 0.234 0.026 9.142 Inf 0.000 0.042 0.044
.eat24 0.610 0.048 12.781 636.995 0.000 0.173 0.209
.eat3 0.412 0.042 9.853 1556.282 0.000 0.110 0.124
.eat18 0.465 0.044 10.516 724.078 0.000 0.162 0.193
.eat21 0.269 0.039 6.852 869.260 0.000 0.148 0.173
drive 1.000
pre 1.000
fitMeasures(miCfa, stat = "D3")
chisq df pvalue ariv
46.156 34.000 0.080 0.189
fmi npar ntotal logl
0.159 31.000 400.000 -4676.490
unrestricted.logl aic bic bic2
-4649.044 9414.979 9538.715 9440.350
baseline.chisq baseline.df baseline.pvalue cfi
1741.955 45.000 0.000 0.993
rni nnfi tli rfi
0.993 0.991 0.991 0.965
nfi pnfi ifi mfi
0.974 0.736 0.993 0.985
rmsea rmsea.ci.lower rmsea.ci.upper rmsea.pvalue
0.030 0.000 0.050 0.950
gammaHat adjGammaHat rmr rmr_nomean
0.994 0.990 0.030 0.033
srmr srmr_bentler srmr_bentler_nomean crmr
0.030 0.030 0.033 0.033
crmr_nomean srmr_mplus srmr_mplus_nomean
0.036 0.030 0.033
Now, we will estimate a latent regression model wherein the preoccupation with food factor and the drive for thinness factor are correlated and regressed onto anxiety, Western beauty standards, and BMI.
Define the model syntax for the SEM described above.
## Add the latent regression paths to the CFA syntax:
semMod <- paste(cfaMod,
'pre ~ b1p * anx + b2p * wsb + bmi',
'drive ~ b1d * anx + b2d * wsb + bmi',
sep = '\n')
Estimate the SEM on the multiply imputed data.
HINT: The semTools::sem.mi() function can be a big help here.
miSem1 <- sem.mi(semMod, data = eaImp, std.lv = TRUE)
Summarize the fitted SEM and interpret the results.
summary(miSem1, fmi = TRUE) %>% print()
The RIV will exceed 1 whenever between-imputation variance exceeds within-imputation variance (when FMI(1) > 50%).
lavaan.mi object based on 20 imputed data sets.
See class?lavaan.mi help page for available methods.
Convergence information:
The model converged on 20 imputed data sets
Rubin's (1987) rules were used to pool point and SE estimates across 20 imputed data sets, and to calculate degrees of freedom for each parameter's t test and CI.
Parameter Estimates:
Standard errors Standard
Information Expected
Information saturated (h1) model Structured
Latent Variables:
Estimate Std.Err t-value df P(>|t|) FMI RIV
drive =~
eat1 0.558 0.039 14.390 5413.411 0.000 0.059 0.063
eat2 0.496 0.036 13.924 Inf 0.000 0.038 0.039
eat10 0.590 0.034 17.272 2261.225 0.000 0.092 0.101
eat11 0.572 0.032 17.716 6730.312 0.000 0.053 0.056
eat12 0.501 0.036 13.894 2425.844 0.000 0.089 0.097
eat14 0.667 0.034 19.493 6530.932 0.000 0.054 0.057
eat24 0.469 0.037 12.735 3366.363 0.000 0.075 0.081
pre =~
eat3 0.629 0.040 15.697 2596.048 0.000 0.086 0.094
eat18 0.614 0.041 15.068 2682.808 0.000 0.084 0.092
eat21 0.684 0.040 17.197 1208.616 0.000 0.125 0.143
Regressions:
Estimate Std.Err t-value df P(>|t|) FMI RIV
pre ~
anx (b1p) 0.185 0.022 8.363 2533.868 0.000 0.087 0.095
wsb (b2p) 0.120 0.033 3.668 966.249 0.000 0.140 0.163
bmi 0.094 0.023 4.112 Inf 0.000 0.029 0.030
drive ~
anx (b1d) 0.194 0.021 9.228 2365.604 0.000 0.090 0.098
wsb (b2d) 0.178 0.032 5.610 2333.176 0.000 0.090 0.099
bmi 0.138 0.022 6.178 Inf 0.000 0.031 0.032
Covariances:
Estimate Std.Err t-value df P(>|t|) FMI RIV
.drive ~~
.pre 0.381 0.057 6.665 4408.903 0.000 0.066 0.070
Variances:
Estimate Std.Err t-value df P(>|t|) FMI RIV
.eat1 0.592 0.048 12.376 2757.270 0.000 0.083 0.091
.eat2 0.517 0.041 12.484 1886.923 0.000 0.100 0.112
.eat10 0.353 0.031 11.349 3248.520 0.000 0.076 0.083
.eat11 0.299 0.027 11.106 Inf 0.000 0.043 0.045
.eat12 0.532 0.043 12.490 1349.385 0.000 0.119 0.135
.eat14 0.253 0.026 9.644 Inf 0.000 0.040 0.041
.eat24 0.596 0.047 12.715 633.051 0.000 0.173 0.210
.eat3 0.403 0.041 9.850 1498.585 0.000 0.113 0.127
.eat18 0.453 0.043 10.447 778.068 0.000 0.156 0.185
.eat21 0.288 0.038 7.564 1162.639 0.000 0.128 0.147
.drive 1.000
.pre 1.000
| Estimate | t | p | FMI | |
|---|---|---|---|---|
| pre ~ anx | 0.185 | 8.363 | < 0.001 | 0.087 |
| pre ~ wsb | 0.120 | 3.668 | < 0.001 | 0.140 |
| drive ~ anx | 0.194 | 9.228 | < 0.001 | 0.090 |
| drive ~ wsb | 0.178 | 5.610 | < 0.001 | 0.090 |
Given an SEM fitted to multiply imputed data via semTools::lavaan.mi(), we
can use the testing utilities from semTools to evaluate (multivariate)
hypotheses defined in terms of parameter constraints.
Apply a multivariate Wald test to the fitted model from 3.1.4.
HINT: The semTools::lavTestWald.mi() function can be very helpful here.
## Define the constraints implied by the null hypothesis:
cons <- '
b1p == b1d
b2p == b2d
'
## Test the constraints via a multivariate Wald test:
lavTestWald.mi(miSem1, cons, test = "D1")
F df1 df2 pvalue ariv fmi
1.227 2.000 6923.960 0.293 0.071 0.067
The Wald test returned a nonsignificant result ( \(F[2, 6923.96] = 1.23\), \(p = 0.293\), \(FMI = 0.067\) ). Hence, we cannot infer a difference in the effect of either predictor on either outcome.
Test the hypothesis from 3.2.1 using a likelihood ratio test (LRT).
## Define the restricted model by combining the constraint and SEM syntax:
semMod2 <- paste(semMod, cons, sep = '\n')
## Estimate the restricted model:
miSem2 <- sem.mi(semMod2, data = eaImp, std.lv = TRUE)
## Conduct the LRT:
anova(miSem1, miSem2, stat = "D3")
F df1 df2 pvalue ariv fmi
1.236 2.000 5773.641 0.291 0.079 0.073
As above, the LRT returned a nonsignificant result ( \(F[2, 5773.64] = 1.24\), \(p = 0.291\), \(FMI = 0.073\) ). So, the restricted model does not fit significantly worse than the full model, and, consequently, we cannot infer a difference in the effect of either predictor on either outcome.
Next, we’ll consider methods for modeling mediation when working with MI data. To keep things simple, we’re only going to estimate wherein Western beauty standards predict preoccupation with food, and drive for thinness mediates this relation.
Define the model syntax for the structural model.
## Define only the new structural parts:
semMod3 <- '
pre ~ b * drive + cp * wsb
drive ~ a * wsb
ab := a * b
'
## Add on the measurement part:
semMod3 <- paste(cfaMod, semMod3, sep = '\n')
Combining mediation with MI is a bit tricky. When treating the missing data with MI, we produce \(M\) different datasets, but we also need to create \(B\) bootstrap samples to test the indirect effects. The best method for combining these two layers of sampling is not immediately obvious, but the literature does provide some guidance. Wu and Jia (2013) recommend nesting bootstrapping within MI via an approach that they call MI(Boot). The MI(Boot) procedure entails the following steps.
The MI(Boot) approach is not currently implemented in lavaan or semTools (neither is any alternative method of combining MI and bootstrapping). If you really wanted to, however, you could implement MI(Boot) manually.
First, we’ll define a function that will take a fitted lavaan object (lavObj)
as input and return the bootstrapped estimates of all defined parameters
specified in the model syntax of lavObj.
getDefParEstimates <- function(lavObj) {
## Use lavaan::lav_partable_constraints_def() to generate a function that
## computes all defined parameters represented in 'lavObj':
getDefPar <- lavObj %>% parTable() %>% lav_partable_constraints_def()
## Extract the bootstrapped coefficient estimates from 'lavObj':
cf <- inspect(lavObj, "coef.boot")
## Compute the defined parameters on each bootstrapped sample:
apply(cf, 1, getDefPar)
}
Now, we’ll use the semList() function from lavaan to fit semMod3 to each
of the M = 20 imputed datasets in eaImp.
getDefParEstimates() function defined above to extract the
relevant estimates (i.e., the bootstrapped estimates of the indirect effect)
from each one of our 20 fitted lavaan objects.NOTE: This code will take quite a long time to run. So, if you want to run the code yourself, consider first doing so with a small number of bootstrap samples (e.g., \(B = 50\)) to get a sense of the run time to expect.
system.time() and extrapolate from there.## Fit the models:
fits <- semList(semMod3,
dataList = eaImp,
std.lv = TRUE,
se = "boot",
bootstrap = 1000,
FUN = getDefParEstimates)
Finally, we aggregate the bootstrapped estimates of the indirect effect from
above (these live in the funList slot of the fits object) into a single
sample with \(N = 20 \times 1000 = 20000\) observations.
## Aggregate estimates:
ab <- fits@funList %>% unlist()
## Use the bca() function from the coxed package to compute the BCa CI for the
## indirect effect:
coxed::bca(ab)
[1] 0.1054825 0.2158876
The MI(Boot) approach will produce theoretically optimal results, but it does so at a high computational cost. Implementing the approach for lavaan models is also a bit tricky. Thankfully, we have a very satisfactory alternative: the Monte Carlo method proposed by MacKinnon et al. (2004).
The justification for the Monte Carlo method begins with the original motivation for using bootstrapping to test the indirect effect: the \(a\) and \(b\) paths are normally distributed, so their product cannot be. Rather than going fully non-parametric (as in bootstrapping), though, the Monte Carlo method leverages the known sampling distributions of \(a\) and \(b\) to synthesize an empirical sampling distribution of their product (i.e., the indirect effect). The Monte Carlo method entails the following procedure.
Fortunately for us, the Monte Carlo simulation approach is implemented in the
monteCarloCI() function from semTools. For more information on this
approach, see MacKinnon et al. (2004) and
Preacher and Selig (2012).
Use the Monte Carlo simulation approach to test the above mediation hypothesis.
## Estimate the mediation model on the MI data:
miSem3 <- sem.mi(semMod3, data = eaImp, std.lv = TRUE)
## Compute Monte Carlo (percentile) CIs:
monteCarloCI(miSem3, plot = TRUE)
Yes, the indirect effect is statistically significant. It is also interesting to note that the Monte Carlo CI is very similar to the MI(Boot) CI estimated above.
Now that we’ve had our fun with MI, lets explore the final topic of this lab: full information maximum likelihood (FIML).
Estimate the CFA defined in 1.3.1 using FIML.
fimlCfa1 <- cfa(cfaMod, data = ea, std.lv = TRUE, missing = "fiml")
To implement basic FIML estimation, we need only add the missing = "fiml"
option to our cfa() call.
Summarize the fitted model.
summary(fimlCfa1, fmi = TRUE)
lavaan 0.6-12 ended normally after 47 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 31
Number of observations 400
Number of missing patterns 31
Model Test User Model:
Test statistic 47.104
Degrees of freedom 34
P-value (Chi-square) 0.067
Parameter Estimates:
Standard errors Standard
Information Observed
Observed information based on Hessian
Latent Variables:
Estimate Std.Err z-value P(>|z|) FMI
drive =~
eat1 0.741 0.050 14.763 0.000 0.068
eat2 0.650 0.045 14.383 0.000 0.022
eat10 0.807 0.043 18.926 0.000 0.042
eat11 0.764 0.040 19.207 0.000 0.006
eat12 0.662 0.047 14.056 0.000 0.085
eat14 0.901 0.041 21.788 0.000 0.005
eat24 0.623 0.048 12.876 0.000 0.093
pre =~
eat3 0.772 0.046 16.675 0.000 0.020
eat18 0.749 0.048 15.498 0.000 0.084
eat21 0.862 0.045 18.956 0.000 0.062
Covariances:
Estimate Std.Err z-value P(>|z|) FMI
drive ~~
pre 0.622 0.038 16.195 0.000 0.024
Intercepts:
Estimate Std.Err z-value P(>|z|) FMI
.eat1 4.002 0.055 73.099 0.000 0.040
.eat2 3.934 0.050 79.135 0.000 0.032
.eat10 3.955 0.050 78.611 0.000 0.032
.eat11 3.938 0.047 83.777 0.000 -0.000
.eat12 3.926 0.051 77.399 0.000 0.051
.eat14 3.963 0.051 77.484 0.000 -0.000
.eat24 3.980 0.051 77.922 0.000 0.056
.eat3 3.968 0.050 78.900 0.000 -0.000
.eat18 3.974 0.052 77.082 0.000 0.046
.eat21 3.950 0.051 77.923 0.000 0.022
drive 0.000
pre 0.000
Variances:
Estimate Std.Err z-value P(>|z|) FMI
.eat1 0.602 0.048 12.434 0.000 0.075
.eat2 0.534 0.042 12.707 0.000 0.059
.eat10 0.329 0.030 11.036 0.000 0.087
.eat11 0.300 0.026 11.426 0.000 0.023
.eat12 0.538 0.043 12.457 0.000 0.086
.eat14 0.235 0.025 9.362 0.000 0.062
.eat24 0.597 0.047 12.706 0.000 0.087
.eat3 0.416 0.041 10.020 0.000 0.058
.eat18 0.453 0.044 10.346 0.000 0.101
.eat21 0.262 0.039 6.663 0.000 0.088
drive 1.000
pre 1.000
fitMeasures(fimlCfa1)
npar fmin chisq df
31.000 0.059 47.104 34.000
pvalue baseline.chisq baseline.df baseline.pvalue
0.067 1949.598 45.000 0.000
cfi tli nnfi rfi
0.993 0.991 0.991 0.968
nfi pnfi ifi rni
0.976 0.737 0.993 0.993
logl unrestricted.logl aic bic
-4441.944 -4418.392 8945.888 9069.624
ntotal bic2 rmsea rmsea.ci.lower
400.000 8971.259 0.031 0.000
rmsea.ci.upper rmsea.pvalue rmr rmr_nomean
0.051 0.940 0.029 0.032
srmr srmr_bentler srmr_bentler_nomean crmr
0.029 0.029 0.032 0.032
crmr_nomean srmr_mplus srmr_mplus_nomean cn_05
0.035 0.029 0.032 413.723
cn_01 gfi agfi pgfi
477.059 0.997 0.994 0.521
mfi ecvi
0.984 0.273
The number of response patterns listed in the summary, 31, is fewer than the number of patterns calculated from the full dataset, 46 .
Generally, the model looks good. The parameter estimates all seem sensible, and the model fits the data very well.
Although the model above seems to have produced satisfactory estimates, the MAR
assumption will have been violated unless the missingness on the indicators of
drive and pre is associated only with other such indicators. If, for example,
missingness on some of the pre items is associated with BMI, then the data
are MNAR, and the model estimates are probably biased.
Rerun the CFA using bmi, anx, and wsb as auxiliary variables.
HINT: The semTools::cfa.auxiliary() function can be very helpful here.
fimlCfa2 <- cfa.auxiliary(cfaMod,
data = ea,
aux = c("bmi", "anx", "wsb"),
std.lv = TRUE)
NOTE: When you run this model, you will probably receive a warning message about a not positive definite residual covariance matrix. As you can see from the eigenvalues below, it is true that the residual covariance matrix is NPD.
## Extract the residual covariance matrix:
theta <- lavInspect(fimlCfa2, "theta")
## Indeed, we have a negative eigenvalue => NPD
eigen(theta)$values
[1] 12.2667170 6.8382415 3.3098964 0.6012041 0.5727693 0.5351208
[7] 0.4822731 0.4311151 0.3585560 0.3149530 0.2801749 0.2426013
[13] -1.7461727
However, I honestly cannot figure out what’s wrong. As you can see below, the residual covariance matrix itself looks fine (i.e., no Heywood cases, no extreme values, all covariances are legal according to the relevant triangle inequalities).
print(theta)
eat1 eat2 eat10 eat11 eat12 eat14 eat24 eat3 eat18 eat21 bmi anx wsb
eat1 0.604
eat2 0.000 0.535
eat10 0.000 0.000 0.328
eat11 0.000 0.000 0.000 0.300
eat12 0.000 0.000 0.000 0.000 0.538
eat14 0.000 0.000 0.000 0.000 0.000 0.234
eat24 0.000 0.000 0.000 0.000 0.000 0.000 0.599
eat3 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.415
eat18 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.451
eat21 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.264
bmi 0.940 0.945 0.730 0.740 0.764 0.877 0.887 0.705 0.700 0.773 7.374
anx 1.435 1.211 1.046 1.191 1.194 1.268 1.256 1.379 1.351 1.279 1.244 9.200
wsb 0.605 0.485 0.503 0.569 0.486 0.652 0.541 0.403 0.523 0.541 1.118 1.049 3.646
If you want to check my sleuthing w.r.t. this issue, you can find the relevant code here. In the end, I’m willing to tentatively trust these estimates.
Summarize the CFA model and check the model fit.
summary(fimlCfa2, fmi = TRUE)
lavaan 0.6-12 ended normally after 102 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 70
Number of observations 400
Number of missing patterns 46
Model Test User Model:
Test statistic 49.044
Degrees of freedom 34
P-value (Chi-square) 0.046
Parameter Estimates:
Standard errors Standard
Information Observed
Observed information based on Hessian
Latent Variables:
Estimate Std.Err z-value P(>|z|) FMI
drive =~
eat1 0.741 0.050 14.772 0.000 0.063
eat2 0.649 0.045 14.369 0.000 0.017
eat10 0.808 0.043 18.956 0.000 0.042
eat11 0.764 0.040 19.212 0.000 0.003
eat12 0.662 0.047 14.055 0.000 0.076
eat14 0.901 0.041 21.811 0.000 0.007
eat24 0.622 0.048 12.875 0.000 0.097
pre =~
eat3 0.772 0.046 16.685 0.000 0.027
eat18 0.751 0.048 15.558 0.000 0.091
eat21 0.862 0.046 18.917 0.000 0.064
Covariances:
Estimate Std.Err z-value P(>|z|) FMI
drive ~~
pre 0.620 0.039 16.078 0.000 0.027
bmi ~~
anx 1.244 0.426 2.917 0.004 0.042
wsb 1.118 0.275 4.065 0.000 0.065
anx ~~
wsb 1.049 0.311 3.370 0.001 0.094
bmi ~~
.eat1 0.940 0.157 5.987 0.000 0.056
anx ~~
.eat1 1.435 0.181 7.949 0.000 0.072
wsb ~~
.eat1 0.605 0.111 5.431 0.000 0.079
bmi ~~
.eat2 0.945 0.143 6.617 0.000 0.033
anx ~~
.eat2 1.211 0.160 7.592 0.000 0.045
wsb ~~
.eat2 0.485 0.101 4.783 0.000 0.098
bmi ~~
.eat10 0.730 0.142 5.150 0.000 0.038
anx ~~
.eat10 1.046 0.160 6.534 0.000 0.038
wsb ~~
.eat10 0.503 0.103 4.885 0.000 0.097
bmi ~~
.eat11 0.740 0.133 5.574 0.000 0.007
anx ~~
.eat11 1.191 0.154 7.733 0.000 0.028
wsb ~~
.eat11 0.569 0.096 5.953 0.000 0.036
bmi ~~
.eat12 0.764 0.143 5.334 0.000 0.059
anx ~~
.eat12 1.194 0.165 7.250 0.000 0.086
wsb ~~
.eat12 0.486 0.104 4.658 0.000 0.129
bmi ~~
.eat14 0.877 0.146 6.025 0.000 0.005
anx ~~
.eat14 1.268 0.169 7.485 0.000 0.043
wsb ~~
.eat14 0.652 0.106 6.135 0.000 0.065
bmi ~~
.eat24 0.887 0.149 5.967 0.000 0.106
anx ~~
.eat24 1.256 0.166 7.564 0.000 0.098
wsb ~~
.eat24 0.541 0.104 5.213 0.000 0.106
bmi ~~
.eat3 0.705 0.141 5.007 0.000 0.015
anx ~~
.eat3 1.379 0.169 8.179 0.000 0.054
wsb ~~
.eat3 0.403 0.100 4.006 0.000 0.037
bmi ~~
.eat18 0.700 0.145 4.843 0.000 0.074
anx ~~
.eat18 1.351 0.171 7.910 0.000 0.092
wsb ~~
.eat18 0.523 0.105 4.991 0.000 0.100
bmi ~~
.eat21 0.773 0.143 5.398 0.000 0.040
anx ~~
.eat21 1.279 0.169 7.579 0.000 0.078
wsb ~~
.eat21 0.541 0.104 5.183 0.000 0.085
Intercepts:
Estimate Std.Err z-value P(>|z|) FMI
.eat1 4.004 0.055 73.123 0.000 0.038
.eat2 3.937 0.050 79.306 0.000 0.030
.eat10 3.953 0.050 78.551 0.000 0.032
.eat11 3.937 0.047 83.777 0.000 -0.000
.eat12 3.929 0.051 77.411 0.000 0.052
.eat14 3.962 0.051 77.484 0.000 -0.000
.eat24 3.986 0.051 78.125 0.000 0.053
.eat3 3.968 0.050 78.900 0.000 -0.000
.eat18 3.982 0.052 77.193 0.000 0.046
.eat21 3.952 0.051 77.906 0.000 0.022
drive 0.000
pre 0.000
bmi 22.406 0.136 164.171 0.000 0.010
anx 11.971 0.154 77.834 0.000 0.028
wsb 8.959 0.098 90.996 0.000 0.060
Variances:
Estimate Std.Err z-value P(>|z|) FMI
.eat1 0.604 0.049 12.436 0.000 0.076
.eat2 0.535 0.042 12.716 0.000 0.056
.eat10 0.328 0.030 11.037 0.000 0.085
.eat11 0.300 0.026 11.428 0.000 0.022
.eat12 0.538 0.043 12.430 0.000 0.088
.eat14 0.234 0.025 9.347 0.000 0.060
.eat24 0.599 0.047 12.663 0.000 0.105
.eat3 0.415 0.041 10.009 0.000 0.058
.eat18 0.451 0.044 10.305 0.000 0.106
.eat21 0.264 0.039 6.712 0.000 0.088
drive 1.000
pre 1.000
bmi 7.374 0.526 14.012 0.000 0.013
anx 9.200 0.675 13.622 0.000 0.062
wsb 3.646 0.268 13.615 0.000 0.067
fitMeasures(fimlCfa2)
npar fmin chisq df
70.000 0.061 49.044 34.000
pvalue baseline.chisq baseline.df baseline.pvalue
0.046 1956.345 45.000 0.000
cfi tli nnfi rfi
0.992 0.990 0.990 0.967
nfi pnfi ifi rni
0.975 0.737 0.992 0.992
logl unrestricted.logl aic bic
-6957.674 -6933.152 14055.349 14334.751
ntotal bic2 rmsea rmsea.ci.lower
400.000 14112.637 0.033 0.005
rmsea.ci.upper rmsea.pvalue rmr rmr_nomean
0.053 0.918 0.038 0.040
srmr srmr_bentler srmr_bentler_nomean crmr
0.025 0.025 0.027 0.027
crmr_nomean srmr_mplus srmr_mplus_nomean cn_05
0.029 0.026 0.027 397.401
cn_01 gfi agfi pgfi
458.232 0.999 0.996 0.326
mfi ecvi
0.981 0.473
The results are all very similar, and this model looks equally good. The similarity of the results should not be too surprising when considering the relatively small amount of missing data. One interesting difference is that the number of missing data patterns given in the output now matches the number we computed for the entire dataset. Of course, this change is also natural since we are now using every variable from the EA data in the model.
Use FIML to estimate the latent regression model defined in 3.1.3.
fimlSem1 <- sem(semMod, data = ea, std.lv = TRUE, missing = "fiml")
Warning in lav_data_full(data = data, group = group, cluster = cluster, : lavaan WARNING: 50 cases were deleted due to missing values in
exogenous variable(s), while fixed.x = TRUE.
Some cases have been deleted due to missing values.
We’re treating the observed predictors as fixed (as is customary in OLS regression models, and as we did in the MI-based analysis). This choice caused no issues when analyzing the MI data because those predictors were completed in every imputed dataset. Now, however, the observed predictors enter the analysis in their incomplete state, and FIML cannot treat their missing data because fixed variables are not represented in the likelihood function.
Consequently, any observations with missing data on one of the observed predictor variables are deleted before estimating the model.
Summarize the fitted SEM.
summary(fimlSem1, fmi = TRUE)
lavaan 0.6-12 ended normally after 99 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 37
Used Total
Number of observations 350 400
Number of missing patterns 28
Model Test User Model:
Test statistic 135.890
Degrees of freedom 58
P-value (Chi-square) 0.000
Parameter Estimates:
Standard errors Standard
Information Observed
Observed information based on Hessian
Latent Variables:
Estimate Std.Err z-value P(>|z|) FMI
drive =~
eat1 0.544 0.040 13.523 0.000 0.048
eat2 0.497 0.037 13.579 0.000 0.029
eat10 0.615 0.036 16.933 0.000 -0.001
eat11 0.592 0.034 17.401 0.000 0.001
eat12 0.492 0.037 13.269 0.000 0.064
eat14 0.666 0.035 18.823 0.000 0.000
eat24 0.479 0.039 12.171 0.000 0.117
pre =~
eat3 0.629 0.042 15.111 0.000 0.023
eat18 0.620 0.042 14.608 0.000 0.104
eat21 0.696 0.042 16.470 0.000 0.029
Regressions:
Estimate Std.Err z-value P(>|z|) FMI
pre ~
anx (b1p) 0.177 0.023 7.778 0.000 0.006
wsb (b2p) 0.113 0.034 3.327 0.001 0.013
bmi 0.099 0.024 4.194 0.000 0.011
drive ~
anx (b1d) 0.191 0.022 8.800 0.000 0.003
wsb (b2d) 0.179 0.033 5.443 0.000 0.003
bmi 0.129 0.023 5.652 0.000 -0.000
Covariances:
Estimate Std.Err z-value P(>|z|) FMI
.drive ~~
.pre 0.389 0.058 6.679 0.000 0.029
Intercepts:
Estimate Std.Err z-value P(>|z|) FMI
.eat1 0.289 0.369 0.784 0.433 0.036
.eat2 0.560 0.335 1.671 0.095 -0.012
.eat10 -0.207 0.358 -0.577 0.564 0.046
.eat11 -0.054 0.345 -0.155 0.877 -0.004
.eat12 0.565 0.334 1.690 0.091 0.024
.eat14 -0.555 0.371 -1.494 0.135 -0.003
.eat24 0.716 0.341 2.101 0.036 0.011
.eat3 0.581 0.380 1.529 0.126 0.004
.eat18 0.648 0.385 1.686 0.092 -0.006
.eat21 0.233 0.400 0.582 0.561 0.049
.drive 0.000
.pre 0.000
Variances:
Estimate Std.Err z-value P(>|z|) FMI
.eat1 0.590 0.050 11.866 0.000 0.055
.eat2 0.508 0.043 11.855 0.000 0.067
.eat10 0.324 0.031 10.358 0.000 0.074
.eat11 0.301 0.028 10.723 0.000 0.015
.eat12 0.496 0.042 11.718 0.000 0.074
.eat14 0.245 0.026 9.333 0.000 0.031
.eat24 0.566 0.048 11.793 0.000 0.116
.eat3 0.415 0.043 9.721 0.000 0.055
.eat18 0.422 0.044 9.555 0.000 0.101
.eat21 0.258 0.039 6.697 0.000 0.071
.drive 1.000
.pre 1.000
## Latent regression estimates for FIML:
partSummary(fimlSem1, 8, fmi = TRUE)
Regressions:
Estimate Std.Err z-value P(>|z|) FMI
pre ~
anx (b1p) 0.177 0.023 7.778 0.000 0.006
wsb (b2p) 0.113 0.034 3.327 0.001 0.013
bmi 0.099 0.024 4.194 0.000 0.011
drive ~
anx (b1d) 0.191 0.022 8.800 0.000 0.003
wsb (b2d) 0.179 0.033 5.443 0.000 0.003
bmi 0.129 0.023 5.652 0.000 -0.000
## Latent regression estimates for MI:
partSummary(miSem1, 2, drops = "riv", fmi = TRUE)
Regressions:
Estimate Std.Err t-value df P(>|t|) FMI
pre ~
anx (b1p) 0.185 0.022 8.363 2533.868 0.000 0.087
wsb (b2p) 0.120 0.033 3.668 966.249 0.000 0.140
bmi 0.094 0.023 4.112 Inf 0.000 0.029
drive ~
anx (b1d) 0.194 0.021 9.228 2365.604 0.000 0.090
wsb (b2d) 0.178 0.032 5.610 2333.176 0.000 0.090
bmi 0.138 0.022 6.178 Inf 0.000 0.031
The parameter estimates are very similar to those produced by MI, and the inferential conclusions are all the same. The FMI estimates for the FIML-based regression slopes are quite a bit smaller than their MI-based counterparts, though.
Rerun the model from 4.3.1 with random predictors.
fixed.x = FALSE
in the sem() call.fimlSem2 <- sem(semMod,
data = ea,
std.lv = TRUE,
missing = "fiml",
fixed.x = FALSE)
Summarize the fitted SEM from 4.3.3.
summary(fimlSem2, fmi = TRUE)
lavaan 0.6-12 ended normally after 114 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 46
Number of observations 400
Number of missing patterns 46
Model Test User Model:
Test statistic 127.166
Degrees of freedom 58
P-value (Chi-square) 0.000
Parameter Estimates:
Standard errors Standard
Information Observed
Observed information based on Hessian
Latent Variables:
Estimate Std.Err z-value P(>|z|) FMI
drive =~
eat1 0.568 0.038 14.876 0.000 0.076
eat2 0.497 0.035 14.392 0.000 0.035
eat10 0.599 0.034 17.410 0.000 0.028
eat11 0.575 0.032 18.184 0.000 0.006
eat12 0.504 0.036 13.974 0.000 0.093
eat14 0.671 0.034 19.768 0.000 0.003
eat24 0.476 0.037 12.933 0.000 0.105
pre =~
eat3 0.631 0.039 16.142 0.000 0.030
eat18 0.614 0.040 15.210 0.000 0.108
eat21 0.690 0.040 17.120 0.000 0.040
Regressions:
Estimate Std.Err z-value P(>|z|) FMI
pre ~
anx (b1p) 0.183 0.022 8.285 0.000 0.066
wsb (b2p) 0.114 0.033 3.496 0.000 0.078
bmi 0.097 0.022 4.320 0.000 0.025
drive ~
anx (b1d) 0.195 0.021 9.359 0.000 0.045
wsb (b2d) 0.172 0.032 5.426 0.000 0.085
bmi 0.138 0.022 6.318 0.000 0.023
Covariances:
Estimate Std.Err z-value P(>|z|) FMI
.drive ~~
.pre 0.392 0.055 7.071 0.000 0.043
anx ~~
wsb 1.004 0.305 3.295 0.001 0.102
bmi 1.116 0.419 2.665 0.008 0.050
wsb ~~
bmi 1.110 0.273 4.065 0.000 0.066
Intercepts:
Estimate Std.Err z-value P(>|z|) FMI
.eat1 0.052 0.362 0.143 0.886 0.035
.eat2 0.481 0.323 1.489 0.136 0.013
.eat10 -0.207 0.339 -0.611 0.541 0.059
.eat11 -0.060 0.322 -0.186 0.852 0.030
.eat12 0.424 0.329 1.287 0.198 0.039
.eat14 -0.700 0.358 -1.956 0.051 0.034
.eat24 0.672 0.327 2.055 0.040 0.031
.eat3 0.579 0.363 1.596 0.110 0.028
.eat18 0.681 0.363 1.876 0.061 0.004
.eat21 0.245 0.383 0.639 0.523 0.070
anx 11.960 0.152 78.529 0.000 0.032
wsb 8.959 0.098 91.276 0.000 0.059
bmi 22.400 0.136 164.413 0.000 0.011
.drive 0.000
.pre 0.000
Variances:
Estimate Std.Err z-value P(>|z|) FMI
.eat1 0.582 0.047 12.357 0.000 0.079
.eat2 0.521 0.041 12.662 0.000 0.058
.eat10 0.346 0.031 11.268 0.000 0.088
.eat11 0.300 0.026 11.512 0.000 0.023
.eat12 0.528 0.042 12.442 0.000 0.085
.eat14 0.252 0.026 9.856 0.000 0.050
.eat24 0.584 0.046 12.673 0.000 0.101
.eat3 0.406 0.040 10.031 0.000 0.060
.eat18 0.442 0.043 10.325 0.000 0.098
.eat21 0.282 0.038 7.381 0.000 0.097
.drive 1.000
.pre 1.000
anx 8.983 0.652 13.768 0.000 0.063
wsb 3.627 0.265 13.682 0.000 0.067
bmi 7.346 0.522 14.069 0.000 0.011
Naturally, we can observe a few differences when modeling the predictors as random variables.
Apart from the obviously problematic deletion of the incomplete cases when \(X\) is treated as fixed, we cannot really say that one approach is better, or more correct, than the other. Treating the predictors as fixed versus random is not simply an estimation choice (as, for example, selecting a different optimization algorithm or adjusting the convergence criteria would be). When we model \(X\) as random, we are specifying a fundamentally different model than we are when \(X\) is treated as fixed. In the former case, our latent regression model summarizes the joint distribution of \(Y\) and \(X\). In the latter case, \(X\) has no distribution and simply serves to define the mean of \(Y\). Neither of these is the “correct” way to represent the data; they’re just different. As researchers, we need to be aware of the model we’re estimating, and ensure that said model aligns with our inferential objectives.
Use FIML to estimate the mediation model defined in 3.3.1.
bmi and anx as auxiliary variables.NOTE: This code will take a long time to run. You may want to consider using the run timing strategies discussed in the MI(Boot) section.
## Estimate the mediation model using FIML with auxiliaries and bootstrapping:
fimlSem3 <- sem.auxiliary(semMod3,
data = ea,
aux = c("bmi", "anx"),
std.lv = TRUE,
se = "boot",
bootstrap = 2000)
Summarize the fitted mediation model and interpret the results.
summary(fimlSem3, fmi = TRUE)
lavaan 0.6-12 ended normally after 111 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 62
Number of observations 400
Number of missing patterns 46
Model Test User Model:
Test statistic 55.854
Degrees of freedom 42
P-value (Chi-square) 0.075
Parameter Estimates:
Standard errors Bootstrap
Number of requested bootstrap draws 2000
Number of successful bootstrap draws 2000
Latent Variables:
Estimate Std.Err z-value P(>|z|)
drive =~
eat1 0.685 0.046 14.820 0.000
eat2 0.600 0.044 13.641 0.000
eat10 0.744 0.049 15.090 0.000
eat11 0.706 0.044 15.910 0.000
eat12 0.612 0.049 12.382 0.000
eat14 0.832 0.044 18.880 0.000
eat24 0.577 0.046 12.432 0.000
pre =~
eat3 0.602 0.044 13.709 0.000
eat18 0.588 0.045 13.035 0.000
eat21 0.675 0.044 15.372 0.000
Regressions:
Estimate Std.Err z-value P(>|z|)
pre ~
drive (b) 0.699 0.086 8.102 0.000
wsb (cp) 0.049 0.034 1.427 0.154
drive ~
wsb (a) 0.219 0.032 6.760 0.000
Covariances:
Estimate Std.Err z-value P(>|z|)
bmi ~~
anx 1.239 0.407 3.048 0.002
.eat1 0.780 0.139 5.602 0.000
anx ~~
.eat1 1.302 0.175 7.437 0.000
bmi ~~
.eat2 0.806 0.135 5.979 0.000
anx ~~
.eat2 1.091 0.164 6.659 0.000
bmi ~~
.eat10 0.565 0.132 4.277 0.000
anx ~~
.eat10 0.884 0.167 5.311 0.000
bmi ~~
.eat11 0.579 0.124 4.686 0.000
anx ~~
.eat11 1.052 0.150 7.015 0.000
bmi ~~
.eat12 0.624 0.120 5.200 0.000
anx ~~
.eat12 1.073 0.180 5.974 0.000
bmi ~~
.eat14 0.688 0.130 5.300 0.000
anx ~~
.eat14 1.100 0.172 6.402 0.000
bmi ~~
.eat24 0.746 0.130 5.731 0.000
anx ~~
.eat24 1.148 0.157 7.305 0.000
bmi ~~
.eat3 0.579 0.147 3.935 0.000
anx ~~
.eat3 1.261 0.163 7.736 0.000
bmi ~~
.eat18 0.564 0.146 3.853 0.000
anx ~~
.eat18 1.247 0.164 7.587 0.000
bmi ~~
.eat21 0.626 0.138 4.535 0.000
anx ~~
.eat21 1.157 0.156 7.407 0.000
bmi ~~
wsb 1.052 0.276 3.809 0.000
anx ~~
wsb 0.927 0.312 2.974 0.003
Intercepts:
Estimate Std.Err z-value P(>|z|)
.eat1 2.660 0.204 13.055 0.000
.eat2 2.762 0.183 15.124 0.000
.eat10 2.494 0.214 11.655 0.000
.eat11 2.554 0.209 12.213 0.000
.eat12 2.731 0.190 14.371 0.000
.eat14 2.332 0.234 9.987 0.000
.eat24 2.854 0.192 14.837 0.000
.eat3 2.879 0.191 15.084 0.000
.eat18 2.919 0.201 14.555 0.000
.eat21 2.733 0.217 12.603 0.000
wsb 8.958 0.097 92.391 0.000
.drive 0.000
.pre 0.000
bmi 22.406 0.139 161.330 0.000
anx 11.971 0.153 78.194 0.000
Variances:
Estimate Std.Err z-value P(>|z|)
.eat1 0.602 0.068 8.905 0.000
.eat2 0.534 0.053 10.071 0.000
.eat10 0.332 0.039 8.543 0.000
.eat11 0.299 0.027 11.227 0.000
.eat12 0.538 0.058 9.335 0.000
.eat14 0.234 0.029 8.168 0.000
.eat24 0.596 0.048 12.509 0.000
.eat3 0.418 0.043 9.814 0.000
.eat18 0.450 0.050 9.073 0.000
.eat21 0.262 0.042 6.199 0.000
.drive 1.000
.pre 1.000
wsb 3.627 0.252 14.369 0.000
bmi 7.360 0.497 14.796 0.000
anx 9.225 0.685 13.475 0.000
Defined Parameters:
Estimate Std.Err z-value P(>|z|)
ab 0.153 0.027 5.563 0.000
parameterEstimates(fimlSem3, boot.ci.type = "bca.simple") %>%
select(-(1:3)) %>%
tail(1)
End of Lab 6